Missing Data and Imputation

Javier Estrada

Michael Underwood

Elizabeth Subject-Scott

What is Missing Data?

  • Missing data occurs when there are missing values in a dataset
    • Can be intentional or unintentional
  • Missing data is classified into 3 different categories:
    • Missing Completely At Random (MCAR)
    • Missing At Random (MAR)
    • Missing Not At Random (MNAR)

X are the completely observed variables.

Y are the partly missing variables.

Z is the component of the cause of missingness unrelated to X and Y.

R is the missingness.

Methods to Handle Missing Data

  • Likelihood Bayesian Method
    • Predicts missing values based on a previous predictive distribution.
  • Weighting Method
    • Uses weights from available data to adjust for missing values.
  • Imputation Method
    • Uses estimates from original data to determine missing values

Deleting Missing Data

  • When type is MCAR and the amount of missing data is small, deletion can be used.

  • 2 Types

    • Listwise deletion occurs when the entire observation is removed.

    • Pairwise deletion occurs when the variable of an observation is removed.

  • Deleting missing data can lead to the loss of important information regarding your dataset and is not recommended.

Preferred Method:

  • Imputation

    • 2 Types

      • Single Imputation

        • Only one estimate is used to replace the missing data.
      • Multiple Imputation

        • Various estimates are used to replace the missing data by creating mulitple versions of the original dataset.

Single or Univariate Imputation

  • Methods include:

    • Using the mean to replace a missing value.
      • The problem with this method is that it reduces the variance which leads to a smaller confidence interval.
    • Last Observation Carried Forward (LOCF) replaces a missing value with a previously observed value (the most recent value is carried forward).
      • The problem with this method is that it assumes that the previous observed value is perpetual, when in reality that may not be the case.

Multiple Imputation

  • A set of m plausible values are generated for each unobserved data point, resulting in M complete data sets.
  • The new values are randomly drawn from predictive distributions either through joint modeling (JM) or fully conditionalspecification (FCS).
  • It is then analyzed and the results are combined, or pooled together, to obtain a single value for the missing data.
  • Multiple imputation by chained methods (MICE) is the most common and preferred method of multiple imputation.

Rubin’s Rules: Average the estimates across m estimates.

Calculate the standard errors and variance of m estimates.

Combine using an adjustment term (1+1/m).

Other Methods of Imputation

  • Regression Imputation is based on a linear regression model. Missing data is randomly drawn from a conditional distribution when variables are continuous and from a logistic regression model when they are categorical.

  • Predictive Mean Matching is also based on a linear regression model. The approach is the same as regression imputation except instead of random draws from a conditional distribution, missing values are based on predicted values of the outcome variable.

  • Hot Deck (HD) imputation is when a missing value is replaced by an observed response of a similar unit, also known as the donor. It can be either random or deterministic, which is based on a metric or value. It does not rely on model fitting.

  • Stochastic Regression (SR) Imputation is an extension of regression imputation. The process is the same but a residual term from the normal distribution of the regression of the predictor outcome is added to the imputed value. This maintains the variability of the data.

  • Random Forest (RF) Imputation is based on machine learning algorithms. Missing values are first replaced with the mean or mode of that particular variable and then the dataset is split into a training set and a prediction set. The missing values are then replaced with predictions from these sets. This type of imputation can be used on continuous or categorical variables with complex interactions.

Methodology

According to Rubin’s Rule, in multiple imputation m imputed values are created for each of the missing data and result in M complete datasets. For each of the M datasets, an estimate of \(\theta\) is acquired. Let \({\hat{\theta}}_{m}\) and \({\hat{\phi}}_{m}\) be an estimator of the variance of \({\hat{\theta}}_{m}\) based on the Mth complete dataset.

The combined estimator of \(\theta\) is given by:

\[{\hat{\theta}}_{M} = \displaystyle \frac{1}{M}\sum_{m = 1}^{M} {\hat{\theta}}_{m}\]

The proposed variance estimator of \({\hat{\theta}}_{M}\) is given by:

\[{\hat{\Phi}}_{M} = {\overline{\phi}}_{M} + (1+\displaystyle \frac{1}{M})B_{M}\]

with a correction factor of \((1+\displaystyle \frac{1}{M})\),

where the average within imputation variance is:

\[{\overline{\phi}}_{M} = \displaystyle \frac{1}{M}\sum_{m = 1}^{M}{\hat{\phi}}_m\]

and the between imputation variance is:

\[B_{M} = \displaystyle \frac{1}{M-1}\sum_{m = 1}^{M}({\hat{\theta}}_{m}-{\overline{\theta}}_{M})^{2}\]

Assumptions of Multiple Imputation

  • Observed data follow a multivariate normal distribution.

  • Missing data are classified as MAR, which is the probability that a missing value depends only on observed values and not unobserved values.

  • The parameters \({\theta}\) of the data model and the parameters \({\phi}\) of the model for the missing values are distinct. That is, knowing the values of \({\theta}\) does not provide any information about \({\phi}\).

MICE in R

Dataset

Visualizations

Results

Conclusion

  • Missing data can occur in research for a variety of reasons.
  • It is never a good idea to ignore it. Doing this will lead to biased estimates of parameters, loss of information, decreased statistical power, and weak reliability of findings.
  • The best course of action is to impute the missing data by using multiple imputation.
  • Performing multiple imputaiton will minimize the adverse effects caused by missing data on the analysis.